IEEE/ACM Transactions on Computational Biology and Bioinformatics — Latest Matching Preprints

1

A transformer model explaining mechanisms of drug therapeutic and adverse effects

Ke, J.; Melamed, R. D.

2026-05-13 genetic and genomic medicine 10.64898/2026.05.11.26352917 medRxiv

Top 0.1%

6.7%

Show abstract

Understanding which disease genes are altered by a drug can provide insight into the biology of effect, help us understand adverse drug effects, and suggest new drug uses. Here, we build on our model Draphnet in a new formulation with a similar goal. Draphnet was designed to explain drug therapeutic and side effects by learning a network connecting drugs to the disease genes they alter. Our new model, DraPhormer, has a similar goal but instead of relying on a linear model, learning of drug to gene connections uses a transformer model. DraPhormer integrates drug molecular data, disease genetics, and known drug effects on diseases, along with language models representing all of these entities. We show in simulations that DraPhormer can explain the genetic mechanisms of drug effects. Then, we present our design for incorporating drug and disease biology into the model. Finally, we benchmark the models ability to learn drug indications and side effects in real data.

2

How can AI be compatible with evidence-based medicine?: with an example of analysis of lung cancer recurrence

Usuzaki, T.; Matsunbo, E.; Inamori, R.

2026-04-25 radiology and imaging 10.64898/2026.04.17.26351114 medRxiv

Top 0.1%

3.7%

Show abstract

Despite the remarkable progress of artificial intelligence represented by large language models, how AI technologies can contribute to the construction of evidence in evidence-based medicine (EBM) remains an overlooked issue. Now, we need an AI that can be compatible with EBM. In the present paper, we aim to propose an example analysis that may contribute to this approach using variable Vision Transformer.

3

Benchmark of biomarker identification and prognostic modeling methods on diverse censored data

Fletcher, W. L.; Sinha, S.

2026-04-01 bioinformatics 10.64898/2026.03.29.715113 medRxiv

Top 0.1%

3.6%

Show abstract

The practices of identifying biomarkers and developing prognostic models using genomic data has become increasingly prevalent. Such data often features characteristics that make these practices difficult, namely high dimensionality, correlations between predictors, and sparsity. Many modern methods have been developed to address these problematic characteristics while performing feature selection and prognostic modeling, but a large-scale comparison of their performances in these tasks on diverse right-censored time to event data (aka survival time data) is much needed. We have compiled many existing methods, including some machine learning methods, several which have performed well in previous benchmarks, primarily for comparison in regards to variable selection capability, and secondarily for survival time prediction on many synthetic datasets with varying levels of sparsity, correlation between predictors, and signal strength of informative predictors. For illustration, we have also performed multiple analyses on a publicly available and widely used cancer cohort from The Cancer Genome Atlas using these methods. We evaluated the methods through extensive simulation studies in terms of the false discovery rate, F1-score, concordance index, Brier score, root mean square error, and computation time. Of the methods compared, CoxBoost and the Adaptive LASSO performed well in all metrics, and the LASSO and elastic net excelled when evaluating concordance index and F1-score. The Benjamini-Hoschberg and q-value procedures showed volatile performances in controlling the false discovery rate. Some methods performances were greatly affected by differences in the data characteristics. With our extensive numerical study, we have identified the best performing methods for a plethora of data characteristics using informative metrics. This will help cancer researchers in choosing the best approach for their needs when working with genomic data.

4

TF-IDF k-mer-based Classical and Hybrid Machine Learning Models for SARS-CoV-2 Variant Classification under Imbalanced Genomic Data

Haque, N.; Mazed, A.; Ankhi, J. N.; Uddin, M. J.

2026-04-02 bioinformatics 10.64898/2026.04.02.716024 medRxiv

Top 0.1%

3.2%

Show abstract

Accurate classification of SARS-CoV-2 genomic variants is essential for effective genomic surveillance, yet it is challenged by extreme class imbalance, limited representation of rare variants, and distribution shifts in real-world sequencing data. In this study, we employed hybrid RF-SVM framework designed for robust detection of rare SARS-CoV-2 variants. It integrates a random forest and a polynomial-kernel based support vector machine to enhance sensitivity to minority classes while maintaining overall predictive stability. We systematically compared classical machine learning models, deep learning approaches, and hybrid strategies under both standard and distribution-shifted evaluation settings. Our results show that classical models using TF-IDF-based k-mer features outperform deep learning methods on macro-averaged performance metrics. The Random Forest classifier using TF-IDF Feature achieved the best overall performance, with a macro-averaged F1-score of 0.8894 and an accuracy of 96.3%. The model also demonstrated strong generalization ability, as evidenced by stable cross-validation performance (CV accuracy = 0.9637). Hybrid RF-SVM model further improves rare variant detection under severe class imbalance. Calibration analysis indicates reliable probability estimates for common variants, although challenges persist for minority classes. Overall, this study highlights the limitations of deep learning in highly imbalanced genomic settings and demonstrates that carefully designed hybrid machine learning approaches provide an effective and interpretable solution for rare SARS-CoV-2 variant detection.

5

Graph Neural Networks (GNNs) for Protein-Ligand Interaction Prediction

Khilar, S.; Natarajan, E.

2026-04-24 bioinformatics 10.64898/2026.04.23.720519 medRxiv

Top 0.1%

2.8%

Show abstract

Predicting protein-ligand interactions in the modern drug discovery has revolved from the involvement of artificial intelligence and structural bioinformatics using Graph Neural Networks (GNNs). The limited explainability of GNN models presents an important encumbrance in biomedical research, but it has achieved a high degree of accuracy in determining and identifying binding affinity and active compounds, as evidenced by [1] [2] [3] [4]. Here this research focuses on the interpretation of protein-ligand interactions at a molecular level, a rapidly developing area within Graph Neural Networks (GNNs). Now days modern study handling techniques such as visualization techniques, attention mechanism and model-based feature ascription by model to boost, and make robust and decrease false predictions on binding. Along with some approaches include like graph pooling strategies, message-passing optimization, self-supervised learning, transfer learning and contrastive learning are rapidly utilized to enhance the representative learnings. Furthermore, integration of molecular docking simulations, hybrid deep learning architectures and protein language model gives more reliable & biological predictions of protein-ligand interactions. That focuses on given process that identifies key ligand atoms and binding residues, as well as physicochemical factors influencing affinity, through chemical thought processes. Here this research work identified the challenges of developing biologically significant explanations, transparency, and the corollary dataset biases on interpretability. The research work conducted an in-depth investigation into the consolidation of protein language models to establish more reliable pathways for future research, examining hybrid architectures, transparent and energy-efficient GNNs, and scientifically grounded AI models for drug discovery. My research work highlights that XGNNs establishes a connection between Deep Learning and Biochemical expertise with increased confidence, which will enhance the accuracy of predictive models and computational models.

6

Structure-aware graph attention based hierarchical transformer framework for drug-target binding affinity prediction

Kaira, V. S.; Kudari, Z. D.; P, S. S.; Bhat, R.; G, J.

2026-04-22 bioinformatics 10.64898/2026.04.19.719524 medRxiv

Top 0.2%

1.9%

Show abstract

Drug-target interaction prediction is significant in the hit identification phase of drug discovery, enabling the identification of potential drug candidates for downstream optimization. Traditional computational methods have some drawbacks in their ability to represent 3D structural data for both molecules and target proteins, which is required for the intricate protein-ligand interactions that regulate binding affinity. In this approach, we propose a graph transformer-based model (GTStrDTI) that combines an intragraph attention mechanism with cross-modal attention to enrich the representation of both the drug molecule and target protein. This approach comprehensively models both intramolecular structural features and intermolecular interactions, thereby enhancing binding affinity prediction performance. A thorough evaluation on benchmark datasets such as KIBA, DAVIS, and BindingDB_Kd shows that our approach surpasses the state-of-the-art methods under challenging target cold-start settings. Our analysis found that augmenting graph-based 3D structural protein target (C-alpha contact graphs from PDB with threshold distance of 5[A]) and incorporating molecule adjacency information, boosts predictive performance, thus contributing towards narrowing the gap between computational and experimental research.

7

Automatic Bevacizumab Response Prediction in Ovarian Cancer from Digital Pathology Images via Novel AI-based Computational Pipeline

Alsaiari, A.; Turki, T.; Taguchi, Y.-h.

2026-05-04 bioinformatics 10.64898/2026.04.29.721782 medRxiv

Top 0.2%

1.7%

Show abstract

Ovarian cancer is one of the gynecological cancer types, which, if metastasized and not detected early, can cause deaths among women. Therefore, there is a need to accurately predict drug responses to ovarian cancer. A gynecological pathologist inspects abnormality in tissues, followed by providing a report about patients; however, such a diagnostic process is (1) hard; (2) requires experience; and (3) time consuming. Moreover, existing tools are far from perfect. Hence, we present a computational pipeline to improve predicting drug response pertaining to ovarian cancer, derived as follows. First, we download digital pathology images pertaining to ovarian bevacizumab response from the cancer imaging archive repository. We employed histogram of oriented gradients to images, constructing feature vectors, provided to Fisher linear discriminant analysis to change the representation through dimensionality reduction. Then, we provide reduced-dimensionality data for regression analysis through support vector regression coupled with various kernels and calculating the area under the ROC curve (AUC). Experimental results against transformer-based models (ViT and Swin) and other deep learning (DL) models (VGG16, ResNet50, InceptionV3, MobileNetV2, and EfficientNetB6) demonstrate that our approach with radial kernel (named SVRD+R) yielded an AUC performance improvements of 17% against the best-performing transformer-based model (ViT) while obtaining an AUC performance improvements of 14.9% when compared against the best DL-based model (MobileNetV2). These results demonstrate the superiority and feasibility of our AI-based pipeline when tackling prediction problems pertaining to gynecologic cancer studies. MSC92B05; 68T09

8

Distribution-Aware Federated Learning for Diabetes Prediction Using Tabular Clinical Data Under Non-IID and Class-Imbalanced Settings

Amin, R.; Rana, M. M. H.; Aktar, S.

2026-03-08 developmental biology 10.64898/2026.03.05.709751 medRxiv

Top 0.2%

1.7%

Show abstract

Federated learning (FL) enables collaborative clinical model training without centralized data sharing, yet its deployment is hindered by statistical heterogeneity (non-IID data) and inherent class imbalance across healthcare institutions. Conventional aggregation strategies such as FedAvg and FedProx weight client updates solely by dataset size, ignoring class distributions and thereby biasing the global model toward the majority class. To address this, we propose Distribution-Aware Federated Learning (DA-FL), which introduces a minority-class amplification factor{phi} k computed as the ratio of a clients local positive class rate to the global positive class rate. Combined with class-weighted cross-entropy loss at the client level, DA-FL forms a two-level correction mechanism that mitigates imbalance without additional data sharing. Experiments on the CDC BRFSS 2021 diabetes dataset (236,378 records across five simulated clients under three non-IID levels) show that DA-FL improves F1-Macro by 18.2% and G-Mean by 26.7% over FedAvg under moderate non-IID conditions, while achieving 31-fold greater F1-Macro stability across 30 communication rounds. These findings demonstrate that DA-FL is an effective and practically deployable solution for federated clinical prediction under realistic non-IID and class-imbalanced settings.

9

Identifying genes associated with phenotypes using machine and deep learning

Muneeb, M.; Ascher, D.

2026-03-07 bioinformatics 10.64898/2026.03.05.709665 medRxiv

Top 0.2%

1.7%

Show abstract

Identifying disease-associated genes enables the development of precision medicine and the understanding of biological processes. Genome-wide association studies (GWAS), gene expression data, biological pathway analysis, and protein network analysis are among the techniques used to identify causal genes. We propose a machine-learning (ML) and deep-learning (DL) pipeline to identify genes associated with a phenotype. The proposed pipeline consists of two interrelated processes. The first is classifying people into case/control based on the genotype data. The second is calculating feature importance to identify genes associated with a particular phenotype. We considered 30 phenotypes from the openSNP data for analysis, 21 ML algorithms, and 80 DL algorithms and variants. The best-performing ML and DL models, evaluated by the area under the curve (AUC), F1 score, and Matthews correlation coefficient (MCC), were used to identify important single-nucleotide polymorphisms (SNPs), and the identified SNPs were compared with the phenotype-associated SNPs from the GWAS Catalog. The mean per-phenotype gene identification ratio (GIR) was 0.84. These results suggest that SNPs selected by ML/DL algorithms that maximize classification performance can help prioritise phenotype-associated SNPs and genes, potentially supporting downstream studies aimed at understanding disease mechanisms and identifying candidate therapeutic targets.

10

Skill-Augmented Frontier Agents Nearly Saturate BixBench-Verified-50

Zhang, X.

2026-05-01 bioinformatics 10.64898/2026.04.28.721523 medRxiv

Top 0.3%

1.5%

Show abstract

Large language model (LLM) agents are increasingly used for biological data analysis, but prior benchmark results have given a mixed picture of whether they are ready for routine bioinformatics work. The original BixBench study reported only [~] 17-21% accuracy for frontier agents on open-answer bioinformatics questions [1]. Subsequent curation of BixBench-Verified-50 removed or revised ambiguous items, revealing much higher performance for modern agents [2]. Here we evaluate three frontier-model configurations on the 50 verified questions using the same local benchmark, prompt structure, answer format, and grading pipeline: GPT-5.4 with Claude Scientific Skills and no web access, Claude Opus 4.7 with Claude Scientific Skills and no web access, and GPT-5.5 with Claude Scientific Skills, bioSkills, and web access. The three configurations achieve 88.0% (44/50), 84.0% (42/50), and 98.0% (49/50) accuracy, respectively. The remaining GPT-5.5 error is not a clear analytical failure: the agent correctly computed Spearman correlations on the distributed CRISPRGeneEffect.csv values and selected CCND1, whereas the reference answer is recovered only after interpreting stronger essentiality as the opposite sign of the raw gene-effect score. Offline errors mainly occurred when agents lacked pathway, organism-annotation, BUSCO, or PhyKIT-related resources. These results show that frontier agents equipped with high-quality scientific skills can nearly saturate a curated bioinformatics benchmark, while also emphasizing that question wording, score sign conventions, and access to current external resources remain decisive for reliable evaluation.

11

Optimizing Screening for Intrauterine Fetal Growth Restriction in Low-Resource Settings Using 2D Ultrasound: A Deep Learning Approach

Enywaku, A.; Asiku, R. A.

2026-05-05 radiology and imaging 10.64898/2026.05.04.26352354 medRxiv

Top 0.3%

1.2%

Show abstract

Severe fetal growth restriction (sFGR) affects 5 to 10% of pregnancies worldwide and is a major contributor to perinatal morbidity and mortality, particularly in low- and middle-income countries (LMICs). Traditional 2D ultrasound detection methods suffer from operator dependency, gestational age uncertainty, and limited access to Doppler in many low-resource facilities. This study presents a deep learning framework for sFGR screening and triage using 2D fetal abdominal ultrasound images designed to operate independently of precise gestational dating. Growth restriction severity labels were derived by mapping abdominal circumference measurements to INTERGROWTH-21st term percentiles as a gestational-age-normalized proxy for fetal size restriction when case-level gestational age or birth-weight data are unavailable. A systematic literature review of 37 studies revealed gaps in severity stratification and generalizability. We implemented a DenseNet-121-based model with abdominal circumference measurement for severity-aware classification using a retrospective single-center dataset of 1588 annotated fetal abdominal images from 169 term pregnancies. Patient-wise 3-fold cross-validation and ensemble testing yielded 93.7% accuracy, a weighted F1-score of 0.76, and ROC AUC [≥] 0.98 per class on heldout data. The approach outperforms previously reported single-center methods on this dataset while explicitly targeting LMIC-specific constraints. It demonstrates potential as a gestational-age-independent first-line triage layer for equitable prenatal screening, subject to prospective multi-site validation.

12

DisGeneFormer: Precise Disease Gene Prioritization by Integrating Local and Global Graph Attention

Koeksal, R.; Fritz, A.; Kumar, A.; Schmidts, M.; Tran, V. D.; Backofen, R.

2026-03-14 bioinformatics 10.64898/2026.03.11.711106 medRxiv

Top 0.3%

1.2%

Show abstract

Identifying genes associated with human diseases is essential for effective diagnosis and treatment. Experimentally identifying disease-causing genes is time-consuming and expensive. Computational prioritization methods aim to streamline this process by ranking genes based on their likelihood of association with a given disease. However, existing methods often report long ranked lists consisting of thousands of potential disease genes, often containing a high number of false positives. This fails to meet the practical needs of clinicians who require shorter, more precise candidate lists. To address this problem, we introduce DisGeneFormer (DGF), an end-to-end disease-gene prioritization pipeline. Our approach is based on two distinct graph representations, modeling gene and disease relationships, respectively. Each graph is first processed separately by graph attention and then jointly by a transformer module to combine within-graph and cross-graph knowledge through local and global attention. We propose an evaluation pipeline based on the precision of a top K ranked gene list, with K set to clinically feasible values between 5 and 50, relying solely on experimentally verified associations as ground truth. Our evaluation demonstrates that DGF substantially outperforms existing methods. We additionally assessed the influence of the negative data sampling strategy as well as analyses of the effect of graph topology and features on the performance of our model.

13

Differential Network-Based Causal Graph Learning for Cardiovascular Recurrence Risk Prediction and Factor Discovery

Zhou, M.; Zhang, M.; Wang, J.; Shao, C.; Yan, G.

2026-03-18 cardiovascular medicine 10.64898/2026.03.16.26348547 medRxiv

Top 0.4%

1.0%

Show abstract

Cardiovascular disease is one of the leading causes of death worldwide, with myocardial infarction (MI) being a major cause of both morbidity and mortality among cardiovascular patients. MI Patients face a higher risk of cardiovascular disease recurrence afterwards. Therefore, accurately predicting the risk of recurrence and identifying key risk factors are crucial for clinical decision-making. In this paper, we consider the interrelationships among cardiovascular factors from a systemic perspective. We first construct a differential network for each patient to capture individual-specific deviations in factor relationships and propose a novel method, termed Causal Factor-aware Graph Neural Network (CFGNN), which integrates factor interactions to predict the recurrence risk of MI patients while uncovering key risk factors from a causal perspective. Experimental results demonstrate that CFGNN performs well on hospital-derived datasets in real world, effectively identifying several key risk factors. This method not only deepens our understanding of cardiovascular disease, but also paves the way for more targeted and effective interventions.

14

preSCRIPT: Large-scale prescription search and annotation engine for pharmacogenomic studies

Pieczarka, M.; Pienkowski, P.; Konowalska, P.; Grubarek, S.; Hajto, J.; Hoinkis, D.; Piechota, M.; Borczyk, M.; Korostynski, M.

2026-04-29 genetic and genomic medicine 10.64898/2026.04.28.26351989 medRxiv

Top 0.4%

0.9%

Show abstract

Pharmacogenetics (PGx) has traditionally focused on a small number of high-impact variants affecting drug response due to the fact that PGx studies are labor-intensive and therefore low-throughput. Population biobanks linked to electronic health records (EHRs), including the UK Biobank (UKB) with prescription data for [~]230,000 individuals offer opportunities to scale PGx research. This, however, comes with a challenge as EHRs do not provide direct treatment response outcomes. One way to overcome this is to draw indirect drug response phenotypes from prescription records. Here, we propose preSCRIPT, a framework to filter and annotate raw prescriptions from the UKB to derive phenotypes for analyses which includes an algorithm to distinguish short prescription gaps from true dose changes. As a proof of concept, we applied preSCRIPT to warfarin, paracetamol, codeine, amitriptyline, simvastatin, aspirin, and amlodipine and derived therapy length and median daily doses. We tested associations for those seven drugs and two phenotypes across SNPs, cytochrome P450 (CYP) genes, and HLA alleles. We replicated known associations such as CYP2D6 variants with amitriptyline therapy length and dose, CYP2C9/CYP4F2/CYP2C19 with warfarin dose, and CYP2D6 with codeine dose. For drugs without formal PGx guidelines, we identified an association between CYP2D6 activity and aspirin therapy length and several SNPs, including rs62471929 (CYP3A5), a variant for amlodipine dose, replicated in an independent hold-out set. Overall, our study shows that preSCRIPT can recover established PGx associations, prioritize exploratory novel candidate loci, and may serve as a tool for large-scale pharmacogenomics.

15

Evaluating Individual Level Performance of Polygenic Risk Scores Using Early Onset High Genetic Risk Coronary Artery Disease as a Benchmark

Liang, S.; Kim, M. S.; Sui, Y.; Tan, Y.; Li, L.; Cho, S. M.; Koyama, S.; Liu, Y.; Paruchuri, K.; Chan, A.; Honigberg, M.; Natarajan, P.; Chatterjee, N.; Fahed, A. C.; Yu, Z.

2026-04-18 genetic and genomic medicine 10.64898/2026.04.16.26350801 medRxiv

Top 0.4%

0.9%

Show abstract

Polygenic risk scores (PRSs) are typically validated using population-level metrics, masking variability in individual-level risk prediction and hindering clinical translation. To address this, we introduced a novel framework using a "benchmark" cohort (N=1184) of "unexpected coronary artery disease (CAD)": early-onset patients (<55 years) with a clinical profile--low 10-year risk, no diabetes or severe hypercholesterolemia--that excludes therapy indications. The occurrence of early CAD in these clinically low-risk individuals establishes a "ground truth" for high genetic risk. We evaluated 58 published CAD PRSs and demonstrated a disconnection between population-level performance and individual-level accuracy (proportion of benchmark patients captured). The proportion captured by 58 PRSs varied from 10.8% to 33.1%, and the top-performing score was 2-fold more effective at identifying the benchmark group than established non-genetic biomarkers, such as lipoprotein(a). Furthermore, benchmark patients never captured by any score exhibited significantly healthier lipid profiles. Our framework provides an essential method for validating clinical readiness of PRSs.

16

Benchmarking 80 binary phenotypes from the openSNP dataset using deep learning algorithms and polygenic risk score tools

Muneeb, M. -; Ascher, D.; Myung, Y.; Feng, S.; Henschel, A.

2026-03-09 bioinformatics 10.64898/2026.03.06.710126 medRxiv

Top 0.5%

0.9%

Show abstract

Genotype-phenotype prediction plays a crucial role in identifying disease-causing single nucleotide polymorphisms and precision medicine. In this manuscript, we benchmark the performance of various machine/deep learning algorithms and polygenic risk score tools on 80 binary phenotypes extracted from the openSNP dataset. After cleaning and extraction, the genotype data for each phenotype is passed to PLINK for quality control, after which it is transformed separately for each of the considered tools/algorithms. To compute polygenic risk scores, we used the quality control measures for the test data and the genome-wide association studies summary statistic file, along with various combinations of clumping and pruning. For the machine learning algorithms, we used p-value thresholding on the training data to select the single nucleotide polymorphisms, and the resulting data was passed to the algorithm. Our results report the average 5-fold Area Under the Curve (AUC) for 29 machine learning algorithms, 80 deep learning algorithms, and 3 polygenic risk scores tools with 675 different clumping and pruning parameters. Machine learning outperformed for 44 phenotypes, while polygenic risk score tools excelled for 36 phenotypes. The results give us valuable insights into which techniques tend to perform better for certain phenotypes compared to more traditional polygenic risk scores tools.

17

emb2dis: a novel protein disorder prediction tool based on ResNets, dilated convolutions & protein language models

Duarte, S. A.; Mehdiabadi, M.; Bugnon, L. A.; Aspromonte, M. C.; Piovesan, D.; Milone, D. H.; Tosatto, S.; Stegmayer, G.

2026-04-01 bioinformatics 10.64898/2026.03.30.715414 medRxiv

Top 0.5%

0.9%

Show abstract

Intrinsically disordered proteins (IDPs) play an important role in a wide range of biological functions and are linked to several diseases. Due to technical difficulties and the high cost of experimental determination of disorder in proteins, combined with the exponential increase of unannotated protein sequences, the development of computational methods for disorder prediction became an active area of research in the last few decades. In this work, we present emb2dis, a deep learning model that uses protein language models (pLMs) to predict disorder from sequence. The emb2dis tool is a pre-trained model that receives as input a protein sequence, calculates its pLM embedding and passes it to a deep learning model. In contrast to existing approaches, emb2dis integrates informative sequence representations with a novel architecture that combines residual networks (ResNets) and dilated convolutions. This design effectively enlarges the receptive field of the convolution operation, enabling the model to better capture an extended context of each amino acid. At the output, emb2dis assigns a disorder propensity score to each residue in the sequence. The model was evaluated on datasets from the latest CAID3 blind benchmark for disorder prediction, where it achieved first place in the Disorder-PDB category, exhibiting strong performance with high AUC and Fmax scores. Additionally, it ranked among the top ten methods on the Disorder-NOX dataset. We provide a freely available web-demo for emb2dis and a source code repository for local installation. Weblink for the toolhttps://sinc.unl.edu.ar/web-demo/emb2dis/ The importance of the emb2dis tool is that it provides a new deep learning approach and significant improvements in the prediction of protein disorder, with a simple web interface and graphical output detailing per-residue disorder.

18

Combining amino acid frequency and 1D convolutional neural network embeddings for the identification of protein-protein interactions using a random forest classifier

Sindhi, N. A.; Pawar, N.; Dixson, J.; Garcia, D.

2026-05-18 bioinformatics 10.64898/2026.05.15.725340 medRxiv

Top 0.5%

0.8%

Show abstract

Predicting protein-protein interactions is a fundamental problem in molecular biology. Experimental approaches for identifying protein-protein interactions are time-consuming and labor-intensive, motivating the development of efficient computational alternatives, including machine learning-based methods. However, conventional machine learning methods often rely on manually engineered features that require substantial domain expertise. In this study, we propose a two-stage framework to address these limitations. In the first stage, a one-dimensional convolutional neural network autoencoder is used to automatically learn latent representations from protein sequences. The quality of these features is evaluated through reconstruction error, reflecting how accurately the model reconstructs the original sequence. In the second stage, these learned features are combined with amino acid frequency-based features to form a hybrid feature set for predicting protein-protein interactions. A systematic comparison is performed between models trained on frequency features alone and those using a hybrid representation. The comparison showed that incorporating one-dimensional convolutional neural network-derived latent features improved the models performance of predicting protein-protein interactions. The dataset was split into training, validation, and test sets. Nested cross-validation was employed, with inner loops for hyperparameter tuning and outer loops for model selection. The random forest classifier achieved the best performance, with a mean receiver operating characteristic-area under curve of 0.91 and a test F1-score of 0.87. These results highlight the effectiveness of integrating deep feature learning with ensemble methods for predicting protein-protein interactions and build upon previous work focused on this fundamental problem. Author SummaryProtein-protein interactions are fundamental in all biological processes. However, predicting these interactions is a key problem in molecular biology. Computational approaches have been tested to address this problem. We applied a mix of machine learning and deep learning to gain insight into the qualities of proteins that engage in interaction. First, we trained a deep learning model, which automatically learned the primary sequence and characters related thereto, reducing bias in the actual prediction process. We combined these features, or latent representations, with amino acid frequency features of protein sequences, and called the two together "hybrid features." Then we performed a systematic comparison of amino acid frequency features-only with hybrid features, among four different machine learning classifiers. Our results suggest that the random forest classifier performed best among all four classifiers at predicting interactions between proteins. We propose that this approach could be used to improve efficiency in testing protein-protein interactions at the bench and may have applications to other biologically relevant molecular interactions.

19

Uncertainty-aware graph representation learning with positive-unlabeled classification for biomarker discovery in peripheral artery disease

Ayyalasomayajula, V. S. R. K.; Senders, M. L.; Wolterink, J. M.; Yeung, K. K.

2026-05-13 systems biology 10.64898/2026.05.08.723757 medRxiv

Top 0.5%

0.8%

Show abstract

Peripheral artery disease (PAD) is a complex vascular disorder characterized by heterogeneous molecular mechanisms and incomplete functional annotation, limiting systematic biomarker discovery. Network-based learning approaches provide a powerful framework for disease gene prioritization; however, most existing methods produce overconfident predictions without explicitly accounting for model uncertainty or structural novelty. Here, we present an uncertainty-aware framework for PAD biomarker discovery that integrates unsupervised graph representation learning, positive-unlabeled (PU) classification, ensemble prediction, and mechanistic explainability. Node embeddings were learned using multiple unsupervised graph neural network (GNN) objectives and combined with heterogeneous classifiers to generate ensemble-averaged probability estimates and epistemic uncertainty. By jointly modeling predictive confidence and embedding-space novelty, we stratified candidates into high-confidence rediscoveries and structurally novel hypotheses under explicit uncertainty control. Across eight embedding objectives and five classifiers, ensemble aggregation produced stable, well-calibrated predictions and enabled prioritization of 100 candidate PAD-associated proteins. Probability-heavy candidates clustered tightly with known PAD proteins and were enriched for established vascular and hemostatic pathways, including extracellular matrix organization, integrin signaling, coagulation, and fibrinolysis. In contrast, novelty-heavy candidates occupied distinct embedding-space regions and partitioned into multiple coherent clusters enriched for upstream regulatory and signaling processes, including G protein-coupled receptor, ephrin receptor, kinase-driven, and NF-{kappa}B-associated pathways. Five-fold cross-validated comparison with established PU learning baselines demonstrated consistent improvement across all evaluation metrics (AUC 0.916 {+/-} 0.019 vs. 0.821 {+/-} 0.030 for the best baseline), and external validity was confirmed by significant enrichment of top candidates for related cardiovascular disease annotations (5.7x above background). Together, these results demonstrate that integrating uncertainty, novelty, and explainability enables calibrated and biologically grounded biomarker prioritization, with broad applicability to PAD and other complex diseases. Author summaryPeripheral artery disease affects millions of people worldwide but remains underdiagnosed, partly because we lack reliable molecular markers to detect it early. In this study, we developed a computational framework that uses protein interaction network data to predict which proteins may be involved in PAD, even when we only know a small number of confirmed disease-associated proteins. Our approach combines graph neural network embeddings with a machine learning technique called positive-unlabeled learning, which is specifically designed for situations where you have confirmed positives but no confirmed negatives. We also quantify how confident the model is in each prediction and identify candidates that are genuinely novel compared to what is already known. Tested against established methods, our framework consistently found more known disease proteins in cross-validated evaluation. The candidates we identified map to biologically coherent pathways relevant to vascular disease, and our top predictions are enriched for proteins associated with related cardiovascular conditions, providing external validation. This work provides a principled and transparent approach to biomarker discovery that could be applied to other complex diseases with limited molecular annotations.

20

GCN-Mamba: Graph Convolutional Network with Mamba for Antibacterial Synergy Prediction

Su, H.; Liang, Y.; Xiao, W.; Li, H.; Liu, X.; Yang, Z.; Yuan, M.; Liu, X.

2026-03-12 bioinformatics 10.64898/2026.03.10.710738 medRxiv

Top 0.5%

0.8%

Show abstract

The escalating crisis of antimicrobial resistance necessitates novel therapeutic strategies, among which drug combination therapy shows great promise by enhancing efficacy and reducing toxicity. However, identifying effective synergistic pairs from the vast combinatorial space remains experimentally challenging and resource-intensive. To address this, we introduce GCN-Mamba, a deep learning framework that integrates Graph Convolutional Networks (GCN) with the Mamba State Space Model. This architecture captures both local molecular topological structures and global implicit interactions by leveraging Extended 3-Dimensional Fingerprints (E3FP) and bacterial gene expression profiles. Evaluation on a comprehensive dataset demonstrated that GCN-Mamba significantly outperforms classical machine learning models in predictive accuracy. In a targeted case study against Methicillin-resistant Staphylococcus aureus (MRSA), the model successfully rediscovered known synergistic pairs, such as Quercetin and Curcumin, consistent with recent literature. Furthermore, prospective in vitro validation confirmed a novel synergistic combination of Shikimic acid and Oxacillin, validating the models practical utility. By efficiently prioritizing potential candidates, GCN-Mamba serves as a powerful and reliable tool for accelerating the discovery of synergistic antimicrobial combinations, effectively bridging the gap between computational prediction and experimental validation.